Linking Prospective and Retrospective Provenance in Scripts
نویسندگان
چکیده
Scripting languages like Python, R, and MATLAB have seen significant use across a variety of scientific domains. To assist scientists in the analysis of script executions, a number of mechanisms, e.g., noWorkflow, have been recently proposed to capture the provenance of script executions. The provenance information recorded can be used, e.g., to trace the lineage of a particular result by identifying the data inputs and the processing steps that were used to produce it. By and large, the provenance information captured for scripts is fine-grained in the sense that it captures data dependencies at the level of script statement, and do so for every variable within the script. While useful, the amount of recorded provenance information can be overwhelming for users and cumbersome to use. This suggests the need for abstraction mechanisms that focus attention on specific parts of provenance relevant for analyses. Toward this goal, we propose that fine-grained provenance information recorded as the result of script execution can be abstracted using user-specified, workflow-like views. Specifically, we show how the provenance traces recorded by noWorkflow can be mapped to the workflow specifications generated by YesWorkflow from scripts based on user annotations. We examine the issues in constructing a successful mapping, provide an initial implementation of our solution, and present competency queries illustrating how a workflow view generated from the script can be used to explore the provenance recorded during script execution.
منابع مشابه
Retrospective Provenance Without a Runtime Provenance Recorder
The YesWorkflow (YW) toolkit aims to provide users of scripting languages such as Python, Perl, and R with many of the benefits of scientific workflow automation. YW requires neither the use of a workflow engine nor the overhead of adapting or instrumenting code to run in such a system. Instead, YW enables scientists to annotate their scripts with special comments that reveal the main computati...
متن کاملRevealing the Detailed Lineage of Script Outputs using Hybrid Provenance
We illustrate how combining retrospective and prospective provenance can yield scientifically meaningful hybrid provenance representations of the computational histories of data produced during a script run. We use scripts from multiple disciplines (astrophysics, climate science, biodiversity data curation, and social network analysis), implemented in Python, R, and MATLAB, to highlight the use...
متن کاملFacilitating Reproducible Computing via Scientific Workflows -- an Integrated System Approach
Author: Cao, Yuan. MS Institution: Purdue University Degree Received: May 2017 Title: Facilitating Reproducible Computing via Scientific Workflows -An Integrated System Approach Major Professor: Yao Liang Reproducible computing and research are of great importance for scientific investigation in any discipline. This thesis presents a general approach to provenance in the context of workflows fo...
متن کاملnoWorkflow: Capturing and Analyzing Provenance of Scripts
We propose noWorkflow, a tool that transparently captures provenance of scripts and enables reproducibility. Unlike existing approaches, noWorkflow is non-intrusive and does not require users to change the way they work – users need not wrap their experiments in scientific workflow systems, install version control systems, or instrument their scripts. The tool leverages Software Engineering tec...
متن کاملnoWorkflow: a Tool for Collecting, Analyzing, and Managing Provenance from Python Scripts
We present noWorkflow, an open-source tool that systematically and transparently collects provenance from Python scripts, including data about the script execution and how the script evolves over time. During the demo, we will show how noWorkflow collects and manages provenance, as well as how it supports the analysis of computational experiments. We will also encourage attendees to use noWorkf...
متن کامل